Salesforce Data Cloud Ingestion from SharePoint

Application details

Technical considerations

This solution is designed for SharePoint Online (not SharePoint Server)
A client application must be registered in Microsoft Entra ID to use this application
The Mule application uses Microsoft Graph APIs to collect information and does not use the legacy SharePoint REST Services
One instance of the Mule application is deployed per SharePoint site and will monitor all Document Libraries in the site for changes
SharePoint content is delivered in the preferred MIME when possible and in PDF, HTML, as-is, or optionally as Base64-encoded text (in that order) when not possible
The Mule application does not support subscriptions for change notifications
The /ping endpoint makes an authenticated request to SharePoint by attempting to obtain a list of document libraries
The Mule application is designed to be stateless except for the full refresh scenerio, which requires some state to be maintained to sequentially process multiple document libraries

Activity diagrams

The following activity diagrams illustrate the sequence of processing to ingest the unstructured metadata and its content on-demand.

Initial Load/Full Refresh Synchronous

Initial Load/Full Refresh Asynchronous

Incremental Load

Get Content

Processing logic

The primary handling and orchestration of unstructured metadata ingestion will be implemented in the Salesforce Data Cloud Ingestion from the SharePoint Process API. This process is described in more detail in the following sections.

Initial Load/Full Refresh Synchronous

This flow is triggered by the end user.

A user clicks the Refresh Now button on the UDLO page to initiate the request for a full refresh of resource metadata
Data Cloud invokes the Mule application without a continuation token to start the process
Mule application receives the request and will:
- Enumerate all libraries in the configured site (only on calls that do not include a continuation token)
- Create an object store entry with a list of all libraries (only on calls that do not include a continuation token)
- Filter out libraries with no content (only on calls that do not include a continuation token)
- Create an artificial continuation token to return to Data Cloud (virtual token) (only on calls that do not include a continuation token)
- Retrieve the site metadata from SharePoint Online
- Transform the results into the Data Cloud required format
- Maintain the state of which libraries have been completely fetched and which still remain
- Maintain the state of the SharePoint continuation token, which is issued by SharePoint for each library
Data Cloud invokes the Mule application in a loop to handle pagination and retrieve metadata until all the metadata content has been retrieved by using the continuation token provided in a previous response

Initial Load/Full Refresh Asynchronous

This flow is triggered by an external application, such as Postman.

Mule application receives a request to perform an asynchronous refresh of all metadata and will:
- Enumerate all libraries in the configured site (only if the object store is empty - initialization)
- Create an object store entry with a list of all libraries (only if the object store is empty - initialization)
- Filter out libraries with no content (only if the object store is empty - initialization)
- Retrieve the site metadata from SharePoint Online
- Transform the results into the required format for the Data Cloud Ingestion API
- Send the transformed data to the ingestion endpoint
- Maintain the state of which libraries have been completely fetched and which still remain
- Maintain the state of the SharePoint continuation token, which is issued by SharePoint for each library
- Loop through all libraries following above steps
If an existing asynchronous operation is running or no libraries with content are found, a 429 (Conflict) HTTP status is returned

Get Content

This flow is triggered by Data Cloud.

Data Cloud initiates the request to retrieve the content
Mule application receives the request to retrieve and stream the content from SharePoint Online
Mule application will attempt to transcode the file to the preferred MIME type as requested by Data Cloud and as supported by Microsoft Graph service

Important notes:

A resource identifier for content retrieval is a concatenation of the drive (library) identifier, a comma separator, and the internal resource identifier (for example, b!yit6plLgAkK3fK_nKKrZd7jE-m_vJGdOgFMp-7pHxbuaIfzv_0USQI1QqN5WM8NB,01FTGOZOM67OM7V6PF2JDK7VCQHM7DV6YZ).
Requesting binary content with the encodeBinaryContent flag set to true will disable streaming due to the nature of the Base64 encoding operation. This may result in request timeouts when attempting to encode very large files.

Incremental Load

Mule application runs a scheduler at a given frequency
If an entry does not exist in the object store, the Mule application will:
- Enumerate all libraries in the configured site (only if the object store is empty - initialization)
- For each library, call Delta Query with token=latest parameter to obtain the current/latest token
- Create an object store entry with a list of all libraries and the latest Delta token (only if the object store is empty - initialization)
- End the process, and the next scheduled execution will locate changes
If an entry exists in the object store, the Mule application will:
- Fetch the object store state for all libraries including the Delta token
- For each library, call the Delta Query with token=value from the Object store to obtain the changes since the last execution
- Publish the metadata to the ingestion API
- Update the object store per library with the most recent Delta token
- If Delta Query has paginated results, the Mule application will follow the "nextLink" until there are no more pages, publishing each page of results to the Data Cloud Ingestion API

Success conditions

Upon successful completion, the following conditions will be met:

All metadata associated with unstructured content in the document libraries in SharePoint Online is retrieved and processed.
The full load of metadata is retrieved on-demand.
An incremental load of metadata is uploaded to Data Cloud on a scheduled frequency.
Retrieval of content in PDF and HTML is supported.

Type	Application
Organization	MuleSoft
Published by	MuleSoft Solutions
Published on	Jan 21, 2025

Version	Actions
1.0.11
1.0.10
1.0.9

Application details

Technical considerations

Activity diagrams

Initial Load/Full Refresh Synchronous

Initial Load/Full Refresh Asynchronous

Incremental Load

Get Content

Processing logic

Initial Load/Full Refresh Synchronous

Initial Load/Full Refresh Asynchronous

Get Content

Incremental Load

Success conditions

Asset versions for 1.0.x